This exercise was originally completed in Python, during my completion of Week 4, Case Study 1 of the free access version of Harvard EdX course “Using Python for Research” in the summer of 2022. This translation to R was performed in the Spring of 2023.
The Tidyverse, Janitor, sjmisc, rBokeh and ComplexHeatmap packages were installed prior to generation, and this write-up was created as an Rnotebook with Rstudio.
The use of ComplexHeatmap has a requested citation from the creator:
Gu, Z. (2016) Complex heatmaps reveal patterns and correlations in multidimensional genomic data. Bioinformatics. DOI: 10.1093/bioinformatics/btw313.
Libraries used for this data set were imported.
Rstudio has a feature which makes importation of data from a CSV file to a tibble user friendly. This example write up will skip that step to demonstrate the ability to manually assess imported data.
SuppressMessages was used to suppress default console warning messages that these features were not used.
Bokeh allows data to be visualized with interactive elements that can
render as .html files. The following demonstration is to
show a general use of Bokeh by creating a 5x5 grid composed of two
alternating colors. There is no particular reason to select a 5x5 plot
aside from being an easily manageable size; it is unrelated to our
Whiskies DataSet.
We can indicate the values we want for both our x and y axes, and the
colors we would like to use for our grid. Bokeh takes hex values for
colors. #03f0fc is a bright electric blue, while
#fca103 is a vibrant orange. Selection is based on personal
preference and readability. Since colors can be fully customized, this
provides great adaptability for modification of Bokeh plots for maximum
readability.
We are able to use the built in expand.grid() function
in R to create cartesian coordinates, and we rename our columns to
clarify which values are our xs and which are ys.
We then create all the color values for our Bokeh plot, stored in
bokehColors, using a simple for loop. This
loop iterates over all values in grid, and assigns them
alternating colors from colors using a modulo operator and
indexing.
Finally, we determine our transparency values (alphas)
of each point, where 0 is completely transparent and 1 is entirely
opaque. The built in seq() function of R allows us to
create an evently spaced gradient of values, which will result in
decreasing transparency diagonally from bottom left to top right: the
bottom leftmost value (1,1) will be transparent, while the
top rightmost value (5,5) will be entirely opaque.
Unlike Python, we do not need to declare a ColumnDataSource in R. Rather, we can simply declare and create our figure using R’s built in syntax.
Bokeh can also be utilized to plot using latitude and longitude. It plots similarly to a scatter-plot, which is particularly useful for latitude and longitude naturally taking form in a cartesian (x,y) structure. rBokeh also has the capability to automatically assign a consistent glyph and/or color structure to data points, based on categorical variables. To demonstrate, a dataframe consisting of “X”, “Y”, and “Category” values is generated, and visualized with rBokeh.
Note the change to the configuration of the hover tool: x and y coordinates are still shown when hovering over any data point, but are now labeled as “Location”. Observe how on the rendered visualization, there is no information provided by hover where a data point was not plotted.
The first stage of utilizing any new data set should be exploration of the data included in the set, including its source.
This data set was originally provided as a csv file. We come into the data set with the expectation that each row contains data about a single whiskey produced in Scotland, with one Whiskey per distillery. Each Whiskey is “graded” on various aspects of its flavor profile (though, from the data set itself, we do not know by whom or what specific criteria may have been used for evaluation).
Exploration of the created whiskey DataFrame reveals
that there are 86 included Whiskies, and 18 ‘Columns’ of Data for each
Whiskey.
We can see that the first row appears to indicate column names, so we can simplify our tibble. There is a function for this in the Janitor package.
We use names(whiskey) to retrieve the names of all
columns in the data set, to examine what points of data we expect to be
provided about each whiskey.
We see the first two columns NA and RowID appear to be an arbitrary count.
The third column, Distillery, appears to be the name of the distillery the whiskey came from. From what we expected coming into our dataset, there is one whiskey per distillery. This means that Distillery can serve as our “key”.
We find that all flavors are included in the next 11 columns: Sweetness, Smoky, Medicinal, Tobacco, Honey, Spicy, Winey, Nutty, Malty, Fruity, and Floral.
We find that the remaining columns describe the geographic location of the distillery that produced each whiskey: Postcode, Latitude, Longitude, and Region.
We can drop both columns that appear to serve as a RowID, since we do not need them. Similarly, we will be generating values for the last column, Group, and can drop the imported values. The Dplyr package has a function for this.
Thanks to our tibble, we may notice that our columns all are storing
their data as characters. This is confirmed by checking the
class for all of our columns.
Luckily Dplyr has a method to quickly convert our columns that contain only numeric values to a numeric data class.
We specify which columns we would like to convert. We store the names of the flavors in flavor, and add Latitude and Longitude to a list of numeric column names.
Since we are primarily interested in the relationship between each Whiskey and its flavor profile, we can create two sub-sets of our data frame that only contains the flavors and Distillery name
How are these flavor profiles characterized? In our initial examination, we noted that there appeared to be integer values in each cell relating to an aspect of the flavor profile.
We can examine these scores in a bit more detail, so we understand how these whiskies have been characterized.
First, we can inspect the maximum values of each flavor. Tobacco appears to have the lowest maximum, with a peak score of 1. No flavor has a score higher than 4.
Inspection of minimum values shows that each flavor has a minimum score of 0 except for Sweetness, which has a minimum score of 1.
Each flavor appears to be scored from 0 - 4, with no half scores. Functionally, this is a discrete numeric variable. While there is no legend provided in the data set, it may be a reasonable assumption to believe that a score of 0 indicates a lack of a flavor, while a score of 4 indicates a strong flavor, with intermediary scores dividing the resulting continuum of intensity.
We can perform some basic summary statistics to explore each of these
flavors further, if desired, using the built-in
summary() function in R. We create a subset of the data
(flavorData) that does not have the
Distillery column to avoid those values being
counted.
We see that every whiskey has a rating for each flavor. The median scores appear similar to means with a cursory overview.
Now that we have a general understanding of the shape and form of our data set, we can explore relationships expressed within it.
First, we’d like to examine relationships between individual flavors that compose each flavor profile.
R has a built in function for Linear Correlations:
cor().
The documentation for the cor() function is worth
examining: we can pass a data frame, and R will perform a pairwise
correlation among columns. The flavorData subset will be
most useful here, as we are comparing each flavor to the other.
We are returned a new data frame. Each column still corresponds to each flavor, but each row indicates which flavor the column flavor was associated with rather than an individual whiskey.
Values of Pearson Correlation Coefficients range from -1 to 1. Negative values indicate a negative association (indicating the flavors are less likely to appear together), while positive values indicate a positive association (indicating flavors are more likely to appear together). A value of 0 indicates no relationship. If each flavor was scored in a binary “yes/no” or 0/1 fashion, we might expect the correlation to only extend to the presence of each flavor, but because the flavors are scored on intensity, the intensity of the flavors impact our correlation as well. As such each cell value is the Pearson Correlation Coefficients of the association between the value of the column flavor, factoring in intensity, to the row flavor, also factoring in intensity. We can interpret this to state that the higher absolute value the Pearson Correlation Coefficient, the stronger the link between the intensity of the two flavors.
We store this new data frame of correlation values as
flavorCorrelations.
It is important to remember that these associations do not imply direction or causality! As such, the association of “Sweetness” to “Honey” is the same as the association of “Honey” to “Sweetness”.
Based on our knowledge of Pairwise Linear Correlations and the
cor() function, we can do a few checks to ensure we have an
output that conforms to our theoretical expectations.
dim() command.This is because flavorCorrelations[1,1] represents the
same flavor associated with itself. As each flavor is
only rated once for each whiskey, it logically follows that the
correlation between scores for any flavor, such as
“Sweetness” to “Sweetness” or
“Spicy” to “Spicy” must be 1:1.
This is because association does not have a direction. As such, the correlation between “Sweetness” and “Honey”, the second and sixth flavors (at index 2 and 6 respectively) should be found to be the same regardless of the direction of comparison.
We check this by comparing the values at [2,6] and [6,2] - they should be identical. And they are, both holding a value of 0.1325581.
While this correlation data set, at 12 by 12, is manageable, this process needs to be scaleable.
The ComplexHeatmap library provides us with a robust tool for creating a heatmap of the correlation values.
As we can see, our Heatmap provides a wealth of
information about our flavors at a glance.
We can check our data against our expectations at a glance: 1. We have an 12 x 12 product, reflecting that each flavor was correlated with each other flavor 2. The diagonal represents consistent values of 1, as each flavor has a 1:1 correlation with itself. 3. The data is mirrored symmetrically across the diagonal, as the associations have no direction.
In any individual flavor profile, it is unlikely for a whiskey to have “Sweetness”, while also being either “Medicinal” or “Smoky”. Conversely, a whiskey with “Body” appears to be more likely to also be considered “Winey”.
This is an interesting overview of all of the whiskies, but what else can we do with this data? #### Pairwise Linear Correlations: Distillery to Distillery (Comparison of Flavor Profiles)
The data takes on a much more interesting form for analysis when we
take it’s transpose. When we take the
transpose of a data frame, we “flip” the data across its
diagonal - rows convert into columns, and vice versa.
In our flavorProfile data set, this organizes each
column into the flavor profile of each Distillery’s product. We
can then take the Linear Pairwise Correlations of each column (each
distillery’s flavor profile) to find the relationship between the
flavor profiles of each distillery.
This quantifies how similar the flavor profile of a “Tullibardine” whiskey is to, say, an “Aberfeldy” whiskey. Which could be quite useful. Do you have anyone who is partial to a very particular whiskey, and worried that you might not be able to get that exact one? Or perhaps you are really partial to a particular whiskey, but cautious about branching out to others given the price tag? This allows us to quantify our best “back up!”
We can, and should, do a few basic checks for our transformed data, such as ensuring that the shape is the expected 12 by 86, and checking the first few rows to ensure they take the expected form of flavors for rows and the index for each distillery marking columns.
We see we have to make a few modifications to clean up the new data set, so we perform those and recheck.
We see that the data does follow our expected form: there are 11 flavors represented for 86 whiskies, each flavor is rated from 0-4. The transpose appears to have not had any issues, so we can continue with our analysis.
We might be tempted to analyze this data in the same way as we did the correlation of flavors, but there’s a catch: we now have 86 rows by 12 columns. It becomes much less reasonable to check any given row manually, simply due to the size of the data.
In addition, each flavor profile is now composed of 12 flavors - the mean, standard deviation, and other summary statistics make much less meaningful sense when taken across a flavor profile. We examine the summary statistics for the whiskey from “Tullibardine”.
What does a mean of 1.25 truly indicate in this context? Or an IQR of 0.75-2.00? The minimum value of 0 and maximum value of 3 indicate that the most “intense” flavor is fairly strong and some may be absent, but we get no real understanding of the flavor profile from this at face value.
Now we can create our Linear Correlation Matrix, similarly to how we correlated across individual flavors. This generates the correlation between the flavor profiles of each Distillery.
As a reminder:
We pass a data frame (distilleryProfiles), and R will
perform a pairwise correlation among columns, with each column
representing the entire flavor profile of a single whiskey.
We are returned a new data frame. Each column still corresponds to each distillery’s flavor profile, but each row indicates which distillery’s flavor profile the column distillery’s flavor profile was associated with rather than an individual whiskey. This association calculates for every aspect of the flavor profile, not just the aggregate “mean” scores.
This allows the computer to perform and return a much more robust association than could be easily performed by hand. For two distilleries to be associated, it is not enough that they have similar total values for all flavors, but that they have related intensity values for each individual flavor that composes that flavor profile.
Values of Pearson Correlation Coefficients range from -1 to 1. Negative values indicate a negative association (indicating the flavor profiles of the distilleries are less similar), while positive values indicate a positive association (indicating flavor profiles of the distilleries are more similar). A value of 0 indicates no relationship. If two distilleries create “opposite” flavor profiles, we would expect them to have a negative association, not no association.
We can interpret this to state that the higher absolute value the Pearson Correlation Coefficient, the stronger the similarity of the overall flavor profiles between the whiskies produced at both distilleries. If you are looking for your “backup” or “next best,” look for a strong positive association with a known whiskey you like. If you are looking for something to avoid, look for a strong negative association with a known whiskey you like.
We store this new DataFrame of correlation values as
distilleryCorrelations.
It is important to remember that these associations do not imply direction or causality! As such, the association of “GlenElgin” to “Scapa” is the same as the association of “Scapa” to “GlenElgin”.
In attempting to examine the data for
distilleryCorrelations, the size problem from
distilleryProfiles is even more magnified - it is an 86 by
86 grid. As we have demonstrated checks for size, and examining the
first few rows, we will omit those steps here.
The most benefit for time is found through examination of a visualization of the data.
This is a very, very large and complex image, and it’s hard to read any of the data points unless you have sufficient space on your display.
For simplicity, we can omit the Distillery labels.
As we can see, our Heatmap() provides a wealth of
information about our distilleries at a glance.
We can check our data against our expectations at a glance: 1. We have an 86 x 86 product, reflecting that each distillery’s flavor profile was correlated with each other distillery’s flavor profile 2. The diagonal represents consistent values of 1, as each distillery has a 1:1 correlation with itself. 3. The data is mirrored symmetrically across the diagonal, as the associations have no direction.
We can see bands that indicate certain distilleries appear to have opposite flavor profiles, and others appear to be strongly clustered.
There is a particular inclination this visual pattern gives us: it appears to be a “checkerboard pattern.” This indicates that there may be “clusters” we can sort these whiskies into, which could serve as a “short hand” or “label” that indicates these flavor profiles.
Scotch whiskies are traditionally clustered by “Region”, with six major regions: Speyside, Highlands, Lowlands, Islands, Campbelltown, and Islay. These regions are geographic, but do they also indicate any association with flavor profile? Do whiskies in the same “Region” have similar flavor profiles, or should we have a different indicator on labels?
We already explored the association between distilleries (by
flavor profile) in distilleryCorrelations. Since
our original data set also included “Regions” for each
whiskey, we can continue our analysis to explore this relationship.
First, lets get an understanding of the value that “Region” provides us. We can do this with a few rBokeh grid visualizations.
It would be tempting to simply use
distillerycorrelations, as this already shows us all of the
associations between our distilleries. However, this
data is not sorted meaningfully by “Region” and would
be difficult to interpret. To solidify this, lets go through the process
of visualization without sorting, and compare it to our sorted
model.
First, we recognized that the Region variable in the original data set indicated these regions. We can convert this column to a factor to reflect that.
Second, we regonize that our linear correlation matrix has 86 colums
and 86 rows, which share the same labels: each of the distilleries, in
order. We can use the built in rep() function of R to
generate an (x,y) grid representing each Distillery
pairing.
We start by converting our Correlations into a list.
We can then build our X and Y values for a rBokeh visualization of
our correlations, based on the default method that R used to turn our
distilleryCorrelation dataframe into a list. Since our
correlations have each Distillery paired with each
Distillery, we can use our Distillery
values to make this process more streamlined with the rep()
function.
rusXs <- rep(whiskey$Distillery,dim(whiskey)[1])
rusYs <- rep(whiskey$Distillery, each=dim(whiskey)[1])
rusData<-data.frame(rusYs,rusXs)
rusData$Correlations<-ruCorr
We want to define colors for each region, as well as specify
white for a correlation less than or equal to 0.7, and light
grey for a correlation greater than 0.7 but from different regions. We
can use the levels() function to get a regions
list of each factor level in the Region
variable. We can make a list of regionColors the same
length. This will let us use the match function to use
these as a dictionary.
We use a for loop to assign a color to each of our X and Y pairs in our data set, and add them into our data set
ruColors <- vector(mode="list")
for(i in seq(dim(distilleryCorrelations)[1])){
#top of nest, i is row number
for(j in seq(dim(distilleryCorrelations)[1])){
#nest in second loop, j is column number
if (distilleryCorrelations[i,j] < 0.7) {
ruColors <- c(ruColors, white)
}
if (distilleryCorrelations[i,j] >= 0.7){
#get distillery for both column and row whiskies
rowWhiskeyRegion <- as.character(whiskey[whiskey$Distillery == colnames(distilleryCorrelations)[i],]$Region)
colWhiskeyRegion <- as.character(whiskey[whiskey$Distillery == colnames(distilleryCorrelations)[j],]$Region)
#determine if regions are equal
if (colWhiskeyRegion==rowWhiskeyRegion){
ruColors <- c(ruColors, regionColors[match(colWhiskeyRegion,regions)])
}
if (colWhiskeyRegion!=rowWhiskeyRegion){
ruColors <- c(ruColors, lightGrey)
}
}
}
}
rusData$Correlations<-ruCorr
rusData$Colors<-ruColors
And finally, we build our rBokeh visualization. Since there are so
many values for distillery, we hide our x and y axes labels for
readability. We ensure those values are in the hover
information. We also set our transparency levels to the strength of the
correlation, providing even more information visually.
rusrBokeh <- figure(title="Whiskies by Region - Default") %>%
ly_crect(rusXs, rusYs, data = rusData, 0.9, 0.9,
fill_color = ruColors, line_color = ruColors, fill_alpha = ruCorr, line_alpha = ruCorr, hover = "Whiskeys: @rusXs, @rusYs. Correlation: @ruCorr" ) %>%
x_axis(label="",visible=FALSE) %>% y_axis(label="", visible=FALSE)
rusrBokeh
On examination, we can see the order of the Distilleries is not the same as it is in our Whiskies list! Rather, rBokeh automatically sorted alphabetically. If we want to impose our own structure, we have to define numeric X and Y coordinates.
ruXs <- rep(seq(1:dim(whiskey)[1]), each=dim(whiskey)[1])
ruYs <- rep(seq(1:dim(whiskey)[1]),dim(whiskey)[1])
ruData<-data.frame(ruXs,ruYs)
ruData$ruYdist <- rep(whiskey$Distillery,dim(whiskey)[1])
ruData$ruXdist <- rep(whiskey$Distillery, each=dim(whiskey)[1])
ruData$Correlations<-ruCorr
rusData$Correlations<-ruCorr
rusData$Colors<-ruColors
rurBokeh <- figure(title="Whiskies by Region - Unsorted") %>%
ly_crect(ruXs, ruYs, data = ruData, 0.9, 0.9,
fill_color = ruColors, line_color = ruColors, fill_alpha = ruCorr, line_alpha = ruCorr, hover = "Whiskeys: @ruXdist, @ruYdist. Correlation: @ruCorr" ) %>%
x_axis(label="",visible=FALSE) %>% y_axis(label="", visible=FALSE)
rurBokeh
NA
As we can see from either visualization, it’s difficult to tell if there truly are any meaningful associations between distilleries that share the same “Region” or not. There are several light grey squares. There appear to be several blue squares. But how can we tell if there are “Region” values that have poor association? This is where sorting the data improves our visualization.
The difference in the power of the visualization comes from the pre-processing of the data. The importance of preparing your data properly for any visualization cannot be overstated. A poor pre-processing will weaken any visual impact, while appropriate pre-processing will strengthen arguments you make from your analysis. Lets investigate how sorting our whiskies by “Region” prior to analysis and visualization will increase the impact and clarity of our visualization.
R has a built in method that can sort our data frame by a single
column’s value: order(). We specify that we would like to
sort our values by “Region”. In inspecting the data, we
can dtermine our transformation was successful. The rows of the data
were sorted as units, so all data associated with a specific whiskey in
our data set remains associated with that same data, it is simply the
order that has changed.
We then proceed with an identical creation of our Linear Correlation Matrix, and obtain the new order of the list of distilleries.
regionalWhiskeys <- whiskey[order(whiskey$Region),]
regionalDistilleryProfiles <- regionalWhiskeys[flavorSubsetCols] %>% rotate_df
regionalDistilleryProfiles <- regionalDistilleryProfiles %>% row_to_names(row_number=1)
regionalDistilleryProfiles <- regionalDistilleryProfiles %>% mutate_if(is.character,as.numeric)
regionalDistilleryCorrelations <- cor(regionalDistilleryProfiles,method="pearson")
We construct our x and y variables, correlation variables, and colors just as we did for our unsorted analysis.
rsCorr<-as.list(regionalDistilleryCorrelations)
rsXs <- rep(seq(1:dim(regionalWhiskeys)[1]), each=dim(regionalWhiskeys)[1])
rsYs <- rep(seq(1:dim(regionalWhiskeys)[1]),dim(regionalWhiskeys)[1])
rsData <- data.frame(rsXs,rsYs)
rsData$rsYdist <- rep(regionalWhiskeys$Distillery,dim(regionalWhiskeys)[1])
rsData$rsXdist <- rep(regionalWhiskeys$Distillery, each=dim(regionalWhiskeys)[1])
rsData$rsCorr<-rsCorr
rsColors <- vector(mode="list")
for(i in seq(dim(regionalDistilleryCorrelations)[1])){
#top of nest, i is row number
for(j in seq(dim(regionalDistilleryCorrelations)[1])){
#nest in second loop, j is column number
if (regionalDistilleryCorrelations[i,j] < 0.7) {
rsColors <- c(rsColors, white)
}
if (regionalDistilleryCorrelations[i,j] >= 0.7){
#get distillery for both column and row whiskies
rowWhiskeyRegion <- as.character(regionalWhiskeys[regionalWhiskeys$Distillery == colnames(regionalDistilleryCorrelations)[i],]$Region)
colWhiskeyRegion <- as.character(regionalWhiskeys[regionalWhiskeys$Distillery == colnames(regionalDistilleryCorrelations)[j],]$Region)
#determine if regions are equal
if (colWhiskeyRegion==rowWhiskeyRegion){
rsColors <- c(rsColors, regionColors[match(colWhiskeyRegion,regions)])
}
if (colWhiskeyRegion!=rowWhiskeyRegion){
rsColors <- c(rsColors, lightGrey)
}
}
}
}
rsData$Colors<-rsColors
And we generate our rBokeh visualization.
rsrBokeh <- figure(title="Whiskies by Region - Sorted") %>%
ly_crect(rsXs, rsYs, data = rsData, 0.9, 0.9,
fill_color = rsColors, line_color = rsColors, fill_alpha = rsCorr, line_alpha = rsCorr, hover = "Whiskies: @rsXdist,@rsYdist. Correlation: @rsCorr") %>%
x_axis(label="",visible=FALSE) %>% y_axis(label="", visible=FALSE)
rsrBokeh
We can see a much clearer picture from this sorted data!
It appears that there are “pockets” of poor correlations within each “Region”, represented by white space. The grey spaces also give us something to think about - are there correlations that are missed by focusing only on “Region”?